Methods and systems for automated table detection within documents

ABSTRACT

Methods and systems for detecting tables within documents are provided. The methods and systems may include receiving a text of the document that includes a plurality of words depicted in the document image. Feature sets may be calculated for the words and may contain one or more features of a corresponding word of the text. Candidate table words may then be identified based on the features vectors, and may then be used to identify a table location within the document image. In some cases, the candidate table words may be identified using a machine learning model.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation application of U.S. patentapplication Ser. No. 16/700,162 filed on Dec. 2, 2019, which claimspriority to U.S. Provisional Patent Application No. 62/775,062 filed onDec. 4, 2018. The entirety of each application is incorporated herein byreference for all purposes.

BACKGROUND

Many automatic document processing systems scan documents in as documentimages or store them as a text. These systems may also recognize thetext contained within document images using optical characterrecognition (OCR). By recognizing the text of the document image, thedocument processing system may be able to perform further analysis. Forexample, some documents contain information stored within tables thatare relevant to understanding the document. Therefore, after recognizingthe text of the document image, some document processing systems alsoattempt to identify tables and table layout information within thedocument.

SUMMARY

The present disclosure presents new and innovative systems and methodsto detect tables within documents. In one example, a method is providedcomprising receiving a text of a document image that includes aplurality of words depicted in the document image, calculating aplurality of feature sets for the plurality of words, wherein eachfeature set contains information indicative of one or more features of acorresponding word of the plurality of words, and identifying candidatetable words among the plurality of words based on the feature sets. Themethod may further includes identifying, with a clustering procedure, acluster of candidate table words that correspond to a table within thedocument image and defining a candidate table location including acandidate table border of the table that contains the cluster ofcandidate table words.

In another example, the candidate table words are identified using amachine learning model. In a further example, the machine learning modelis a recurrent neural network or a convolutional neural network.

In yet another example, the features of the feature set include one orboth of a text feature and a spatial feature. In a further example, thetext feature includes one or more features selected from the groupconsisting of: orthographic properties of the corresponding word,syntactic properties of the corresponding word, and formattingproperties of the corresponding word. In a still further example, thespatial feature includes one or more features selected from the groupconsisting of: a nearby ruler line distance, a neighbor alignmentmeasurement, and a neighbor distance measurement.

In another example, the candidate table border is defined as therectangle with the smallest area that contains the cluster of candidatetable words. In a still further example, the clustering procedure is adensity-based spatial clustering of applications with noise (DBSCAN)procedure. In yet another example, the candidate table border containsone or more words of the text that are not candidate table words.

In a further example, the method further comprises predicting a readingorder for at least a subset of the words of the text. In anotherexample, the text further includes a location for the subset of words,and predicting the reading order further comprises assigning a firstword of the subset of words as coming before a second word of the subsetof words in the reading order if one or more of the following conditionsare true: (i) the second word is below the first word according to thelocation of the first and second words, or (ii) the second word is atthe same height as the first word and is positioned to the right of thefirst word according to the location.

In another example, the method further comprises receiving a trainingtext of a training document, including a plurality of words depicted inthe training document and a labeled document image indicating a labeledtable location of a table within the training document, calculating aplurality of training feature sets for the words of the training text,wherein each training feature set contains information indicative of oneor more features of a corresponding word of the plurality of words ofthe training text, and identifying, with the machine learning model,candidate training table words of the training text among the words ofthe training text based on the training feature sets. In yet anotherexample, the method further comprises identifying, with the clusteringprocedure, a cluster of candidate training table words that correspondto a table within the document image, defining a candidate trainingtable location including a candidate training table border of thetraining table that contains the cluster of candidate training tablewords, comparing the training table location with the labeled tablelocation to identify a table location error of the training tablelocation, and updating one or more parameters of the machine learningmodel based on the table location error.

In a further example, the machine learning model is initially configuredto identify table locations in document images from documents of a firstdocument type and updating one or more parameters of the machinelearning model enables the machine learning model to identify tablelocations in document images from documents of a second document type.

In another example, a system is provided comprising a processor and amemory. The memory may contain instructions that, when executed by theprocessor, cause the processor to receive a text of a document imagethat includes a plurality of words depicted in the document image,calculate, with a feature set calculator, a plurality of feature setsfor the words, wherein each feature set contains information indicativeof one or more features of a corresponding word of the plurality ofwords, and identify, with a text classifier, candidate table words amongthe plurality of words based on the feature sets. The memory may containfurther instructions that, when executed by the processor, cause theprocessor to identify, with a clustering procedure of a tablerecognizer, a cluster of candidate table words that correspond to atable within the document image, and define, with the table recognizer,a candidate table location including a candidate table border of thetable that includes the cluster of candidate table words.

In another example, the text classifier includes a machine learningmodel configured to identify the candidate table words. In a furtherexample, the machine learning model is a recurrent neural network or aconvolutional neural network.

In a still further example, the features of the feature set include oneor both of a text feature and a spatial feature. In yet another example,the text feature includes one or more features selected from the groupconsisting of: orthographic properties of the corresponding word,syntactic properties of the corresponding word, and formattingproperties of the corresponding word. In a further example, the spatialfeature includes one or more features selected from the group consistingof: a nearby ruler line distance, a neighbor alignment measurement, anda neighbor distance measurement.

In another example, the candidate table border is defined as therectangle with the smallest area that contains the cluster of candidatetable words. In a further example, the density-based clusteringprocedure is a density-based spatial clustering of applications withnoise (DBSCAN) procedure. In a still further example, the candidatetable border contains one or more words of the text that are notcandidate table words.

In yet another example, the memory contains further instructions which,when executed by the processor, cause the processor to predict, with areading order predictor, a reading order for at least a subset of thewords of the text. In a further example, the text further includes alocation for the subset of words, and the memory contains furtherinstructions which, when executed by the processor, cause the processorto assign, with the reading order predictor, a first word of the subsetof words as coming before a second word of the subset of words in thereading order if one or more of the following conditions are true: (i)the second word is below the first word according to the location of thefirst and second words, or (ii) the second word is at the same height asthe first word and is positioned to the right of the first wordaccording to the location.

In a still further example, the system further comprises a trainingsystem configured, when executed by the processor, to receive a trainingtext of a training document, including a plurality of words depicted inthe training document and a labeled document image indicating a labeledtable location of a table within the training document, calculate aplurality of training feature sets for the words of the training text,wherein each training feature set contains one or more features of acorresponding word of the plurality of words of the training text,identify, with the machine learning model, candidate training tablewords of the training text among the words of the training text, andidentify, with the clustering procedure, a cluster of candidate trainingtable words that correspond to a table within the document image. Thetraining system may be further configured, when executed by theprocessor, to define a candidate training table location including acandidate training table border of the training table that contains thecluster of candidate training table words, compare the training tablelocation with the labeled table location to identify a table locationerror of the training table location, and update one or more parametersof the machine learning model based on the table location error.

In yet another example, the machine learning model is initiallyconfigured to identify table locations in document images from documentsof a first document type and updating one or more parameters of themachine learning model enables the machine learning model to identifytable locations in document images from documents of a second documenttype.

In a further example, a computer-readable medium is provided thatcontains instructions which, when executed by a processor, cause theprocessor to receive a text of a document image that includes aplurality of words depicted in the document image, calculate, with afeature set calculator, a plurality of feature sets for the words,wherein each feature set contains information indicative of one or morefeatures of a corresponding word of the plurality of words, andidentify, with a text classifier, candidate table words among theplurality of words based on the feature sets. The computer-readablemedium may also contain instructions which, when executed by aprocessor, cause the processor to identify, with a clustering procedureof a table recognizer, a cluster of candidate table words thatcorrespond to a table within the document image, and define, with thetable recognizer, a candidate table location including a candidate tableborder of the table that includes the cluster of candidate table words.

The features and advantages described herein are not all-inclusive and,in particular, many additional features and advantages will be apparentto one of ordinary skill in the art in view of the figures anddescription. Moreover, it should be noted that the language used in thespecification has been principally selected for readability andinstructional purposes, and not to limit the scope of the inventivesubject matter.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 illustrates a document processing system according to an exampleembodiment of the present disclosure.

FIG. 2A illustrates a table feature description according to anexemplary embodiment of the present disclosure.

FIG. 2B illustrates an example feature set—text association according toan exemplary embodiment of the present disclosure.

FIG. 3 illustrates a flow chart of a method according to an exemplaryembodiment of the present disclosure.

FIGS. 4A and 4B illustrate an example named entity recognition procedureaccording to an example embodiment of the present disclosure.

FIG. 5 illustrates a flow diagram of an example method according to anexample embodiment of the present disclosure.

DETAILED DESCRIPTION OF EXAMPLE EMBODIMENTS

One growing area of automated document processing is the automatedanalysis of documents, including legal documents. For example, automatedtools, such as those from Leverton GmbH, can be used to automaticallyreview large numbers of contracts, leases, title deeds, invoices, andother legal or financial documents during a due diligence process. Thesedocuments may include tables, and information within those tables may beimportant to understanding the content within the documents. Further,other information from the table, such as layout information of thetable itself may also be important to understand the documents. Toautomate the analysis of such documents, an important step is toidentify tables within the documents. Examples of such tables mayinclude financial documents with important figures (e.g., invoiceamounts, total quantity amounts, order dates). For example, totalinvoice and quantity amounts may be useful for understanding the contentof, e.g., a purchasing order or invoice. In another example, a contractmay identify a party or parties to the agreement in a table, such as atable that includes the parties' names and addresses. Properlyattributing the provided names, addresses, and contact information toeach party may require detecting the table, so that the documentprocessing system can incorporate table layout information (e.g., rows,columns, and row/column labels).

Conventional methods of identifying tables in documents generally relyon heuristics developed and refined over time for specific documenttypes. For example, conventional table detection methods may relyprimarily on clustering methods to identify blocks of words that arealigned in a tabular fashion. These methods may then combine multipleblocks of words into tables. However, such heuristic methods often donot work well for difficult and variable table layouts, such as tablelayouts with inconsistent word alignments. Other difficult table layoutsmay include complex alignment patterns for rows and columns of thetables, extra offsets for headers and the table, and consistence spacingbetween rows and columns, and tables with a large proportion of emptycells. Such difficult tables may be found in, e.g., lease contracts,leases amendments, indexation letters, and insurance contracts. Also,tuning these methods is generally very complicated, as each combinationof heuristics generally relies on a large set of hyperparameters.Further, because these heuristics are generally developed for particulartypes of documents, this complexity amplifies the difficulty of tuningthese models for new document types. For example, although certainfinancial documents, such as income statements and cash flow statements,may include similar information, their tabular layouts differ enoughthat identifying tables within documents of each type may require aseparate heuristic. Accordingly, even if a conventional documentprocessing system includes a heuristic for detecting tables withinincome statements, adding table detection for cash flow statements mayrequire a new heuristic. In this way, developing and improving tabledetection for document processing systems can quickly become undulycumbersome.

One innovative procedure, described in the present disclosure, to solveboth of these problems, is to receive a text of a document and createfeature sets for the words of the text. These feature sets may includeone or more features corresponding to words of the text. For example,the feature sets may include text features of a corresponding word orwords of the text, such as orthographic properties, syntacticproperties, and formatting properties of the corresponding word orwords. The feature sets may also include spatial features of thecorresponding words of the text, such as a nearby ruler line distance, aneighbor alignment measurement, and a neighbor distance measurement forthe corresponding word or words.

A machine learning model may then identify candidate table words withinthe text based on the feature sets. These candidate table words may bewords that the machine learning model has identified as potentiallylocated within a table of the document. For example, candidate tablewords may be identified when the machine learning model predicts aprobability higher than a given threshold that a word is located withina table of the document. In identifying the candidate table words, themachine learning model may analyze one or more of the feature sets forfeatures that indicate the word is likely located within a table. Forexample, the machine learning model may analyze the feature setsindividually. In other implementations, the machine learning model mayanalyze multiple feature sets for words that are near one another withinthe document at the same time to better account for relationships andfeatures common between multiple words.

A table may then be identified in the document that includes one or moreof the candidate table words. For example, a table may be identified asthe rectangle with the smallest area that encloses all of the candidatetable words. In another example, the table may be defined as therectangle with the smallest area that encloses a majority of thecandidate table words, or all candidate table words with a certainestimated probability of being in a table. In other implementations, thetable words may first be analyzed using clustering procedures toidentify one or more clusters of candidate words. In suchimplementations, the table may be identified similarly to theabove-discussed methods. For example, the table may be defined as therectangle with the smallest area that encloses all of the candidatetable words within a given cluster.

Detecting tables in this manner leads to several benefits. First,because the candidate table words are recognized by a machine learningmodel, rather than heuristics, the method is easier to establish andexpand for new document types and new table layouts. Also, because themachine learning model is able to incorporate additional properties,such as text features and spatial features, systems utilizing the abovetable detection procedures may be more robust across different documenttypes. For example, although income statements and cash flow statementsmay have different table layouts, the type of text within these tablesis often similar (e.g., financial amounts or numerical amounts).Accordingly, a machine learning model that is able to incorporate suchtext features will be more likely to recognize a table in a new documenttype even though the table layout differs between documents if the textis similar to text contained in previously identified tables.Additionally, the machine learning model may account for the text'sorthographic properties, which current heuristic systems are not capableof doing. This may lead to better recognition of difficult tablelayouts, such as nontrivial word alignment patterns.

FIG. 1 depicts a document processing system 100 according to an exampleembodiment of the present disclosure. The document processing system 100includes a document 102 and a document intake system 104. The documentintake system 104 includes an optical character recognizer 106, a tabledetection system 110, a CPU 136, and a memory 138. The optical characterrecognizer 106 further includes a text 108. The table detection system110 includes a reading order predictor 112, a text classifier 126, afeature set creator 120, and a table recognizer 132. The reading orderpredictor 112 stores a read text 114, which includes the text 108 and areading order 118. The feature set creator 120 stores feature sets 122,124. The text classifier 126 includes a machine learning model 128 andstores a classified text 130. The table recognizer 132 includes acandidate table border 134.

The document intake system 104 may be configured to receive a document102 and recognize the text 108 within the document 102. The document 102may be stored on the memory 138 after the document 102 is received bythe document intake system 104, before the text 108 is recognized. Thedocument 102 may be received from a document server configured to storemultiple documents. The document 102 may be a document image, such as ascanned image of a paper document. In some implementations, if thedocument 102 is a document image, the document intake system 104 mayrecognize the text 108. In such a case, the document intake system 104may recognize the text 108 of the document 102 using the opticalcharacter recognizer 106. The optical character recognizer 106 may beconfigured to perform optical character recognition (OCR) on thedocument 102 to recognize a text 108 of the document 102. In otherimplementations, the document 102 may already have recognized and/orsearchable text (e.g., a word document or a PDF with recognized text).In such a case, the document intake system 104 may recognize the text108 and may instead continue processing the document 102 in the text108.

The document 102 may be a document of a particular document type. Forexample, the document 102 may be a lease agreement, a financialdocument, an accounting statement, a purchase sale agreement, a titleinsurance document, a certificate of insurance, a mortgage agreement, aloan agreement, a credit agreement, an employment contract, an invoice,a financial document, and a news article. Although depicted in thesingular, in some implementations the document intake system 104 may beconfigured to receive and process more than one document 102 at a time.For example, the document intake system 104 may be configured to receivemultiple documents of the same type (e.g., accounting statements) ormaybe configured to receive multiple documents of multiple types (e.g.,leases and accounting documents).

The table detection system 110 may be configured to analyze the document102 and the text 108 to detect tables by identifying candidate tableborders 134 indicating the location of the tables within the document102. For certain documents, or pages within certain documents wherecommon table formatting is involved, the table detection system 110 mayutilize conventional methods (e.g., heuristic methods as discussedabove) to detect tables within the text 108 in the document 102. Forexample, when initially processing documents of a new document type, thedocument intake system 104 may not have sufficient example documents ofthe new document type required to train the machine learning model 128of the text classifier 126 to accurately classify the text 108 to createa classified text 130. Therefore, it may be more accurate for the tabledetection system 110 to rely on conventional methods. But, in otherinstances where the document 102 contains uncommon formatting, or whereit is determined that conventional methods are unsuccessful, the tabledetection system 110 may instead be configured to predict a readingorder 118 with the reading order predictor 112, create feature sets 122,124 using the feature set creator 120, classify the text 108 to create aclassified text 130, and recognize a candidate table border 134 with thetable recognizer 132 based on the classified text 130. In otherimplementations, the table detection system 110 may forego conventionalmethods and solely use the latter method to detect tables within thetext 108 of the document 102. The reading order predictor 112 may beconfigured to predict a reading order 118 of the text 108. Afterpredicting the reading order 118, the reading order predictor 112 maygenerate a read text 114 that includes the text 108 and the readingorder 118. The read text 114 may be used later in processing, forexample by the feature set creator 120 and the text classifier 126. Thereading order predictor 112 may be configured with multiple readingorder prediction procedures. For example, the reading order predictor112 may predict the reading order 118 for documents 102 of differentdocument types (e.g., income statements or cash flow statements) withdifferent reading order prediction procedures. In other embodiments, thereading order predictor 112 may use the same reading order predictionprocedure for documents of all document types. In certainimplementations, the text 108 may also include location coordinates foreach word of the text 108 within the document 102. Using these locationcoordinates, the reading order predictor 112 may predict the readingorder 118 by assigning words that are to the right of and/or below agiven word as coming later in the reading order 118. For example, thereading order predictor 112 may proceed through each word of the text108 and may assign each word of the text 108 as coming after the word toits left. When the reading order predictor 112 gets to the end of a lineof the text 108, it may proceed to the next line, and may assign thefirst word of the next line as coming after the last word of thepreceding line within the reading order 118. In certain implementations,the optical character recognizer 106 may determine a reading order 118accurate enough for use by the table detection system 110. In suchinstances, the reading order predictor 112 may not be necessary and/ormay use the reading order 118 identified by the optical characterrecognizer 106.

The feature set creator 120 may be configured to create feature sets122, 124. The feature sets 122, 124 may each correspond to one or morewords from the text 108. For example, the feature sets 122, 124 maycorrespond to an individual word from the text 108. In creating thefeature sets 122, 124, the feature set creator 120 may receive the text108 and/or the read text 114. The feature set creator 120 may thenprocess the words of the text 108 to calculate one or more text featuresand spatial features. For example, the feature set creator 120 maycalculate text features such as orthographic properties, syntacticproperties, and formatting properties of the corresponding text. Thefeature set creator 120 may also calculate spatial features such as anearby ruler distance, a nearby neighbor alignment measurement, and aneighbor distance measurement. These features are discussed in greaterdetail below in connection with FIGS. 2A and 2B. In certainimplementations, where the feature set creator 120 receives the readtext 114, the feature set creator 120 may utilize the reading order 118in calculating the feature sets 122, 124. For example, the feature setcreator 120 may use the reading order 118 to identify nearby wordswithin the text 108, and may incorporate aspects of the nearby words incalculating a feature set 122, 124.

The text classifier 126 may be configured to analyze the text 108 andthe feature sets 122, 124 to create a classified text 130 thatidentifies candidate table words 140, 142 within the text 108. Incertain implementations, the text classifier 126 may receive the readtext 114 and may utilize both the text 108 and the reading order 118 increating the classified text 130. For example, if a particular word isidentified as a candidate table word 140, 142, nearby words in thereading order 118 may also be likely candidate table words. Accordingly,the text classifier 126 may take this information into account.

The text classifier 126 may be configured to use a machine learningmodel 128 to analyze the feature sets 122, 124 to classify words of thetext 108 to identify candidate table words 140, 142. In doing so, thetext classifier 126 may create a classified text 130 containing thecandidate table words 140, 142. The machine learning model 128 mayinclude, e.g., a convolutional neural network and/or a recurrent neuralnetwork. For example, because the machine learning model 128 may beanalyzing a series of words within the text 108 in order to create theclassified text 130, recurrent neural networks may be preferable.However, in other implementations it may be advantageous to combine bothconvolutional and recurrent networks (e.g., to better incorporatespatial information). The classified text 130 may include each word fromthe text 108 of the document 102, or may only include words surroundingor near candidate table words 140, 142. Limiting the classified text 130to classifying such words may improve performance of the table detectionsystem 110, but including all of the words of the text 108 to create theclassified text 130 may enable greater accuracy (e.g., when there aremultiple tables for detection within the document 102, or on a singlepage of the document 102). The candidate table words 140, 142 mayinclude an indication of an estimated probability that the candidatetable word 140, 142 is included within a table of the document 102. Forexample, the machine learning model 128 may generate an estimatedprobability that each word within the classified text 130 is includedwithin a table of the document 102, and the candidate table word 140,142 may store that estimated probability. In other implementations, thecandidate table words 140, 142 may store only an indication that theword is likely located within a table of the document 102. For example,if the machine learning model 128 estimates a probability above acertain threshold (e.g., 50%, 75%) that a word is included within atable of the document 102, that word may be classified as a candidatetable word 140, 142 and may be stored with an indication that the wordis a candidate table word 140, 142 within the classified text 130.Similarly, the indication stored within the classified text 130 may alsoinclude the above-described estimated probability. Further, theestimated probability may include that estimation that the candidatetable word is one or more of a table word, a table header, or anon-table word. For example, the classified text 130 may include anestimation that each word within the text 108 is a table word, a tableheader, or a non-table word. The classified text 130 may store aseparate estimation for each of these 3 categories for all of the wordsof the text 108, and these estimations may sum to 100%.

The table recognizer 132 may be configured to analyze the classifiedtext 130 to identify one or more candidate table borders 134 of tablescontained within the document 102. In certain implementations, the tablerecognizer 132 may analyze each word of the classified text 130 toidentify clusters of candidate table words 140, 142 within theclassified text 130 and/or the text 108. The clusters of candidate tablewords 140, 142 may be identified using one or more clustering methods,such as density-based clustering methods, model-based clusteringmethods, fuzzy clustering methods, hierarchical clustering methods, andpartitioning clustering methods. For example, the clusters of candidatetable words 140, 142 may be identified using one or more of thefollowing clustering methods: density-based spatial clustering ofapplications with noise (DBSCAN), balanced iterative reducing andclustering using hierarchies (BIRCH), clustering using representatives(CURE), hierarchical clustering, k-means clustering,expectation—maximization (EM), ordering points to identify theclustering structure (OPTICS), and mean-shift analysis. In certainimplementations, the DBSCAN method may be preferred, as the DBSCANmethod does not require cluster number pre-specification, is robustagainst outlier (i.e., false positive) candidate table wordidentification, and can find arbitrarily shaped or sized clusters, whichmay be tuned by specifying a minimum number of candidate table words tobe included within each cluster. Once the cluster or clusters areidentified, the table recognizer 132 may then calculate a candidatetable border 134 identifying an estimated location of a table within thedocument 102. In certain implementations, the candidate table border 134may be defined as the smallest rectangular border that contains all ofthe candidate table words 140, 142 within a given cluster. In this way,the table recognizer 132 may identify a separate table for each clusterof candidate table words 140, 142. In other implementations, such as fordocument types that rarely include more than one table on a page, thecandidate table border 134 may be defined as the smallest rectangularborder that encloses all of the candidate table words 140, 142 andclusters of candidate table words 140, 142 on a given page of thedocument 102. In still further implementations, the table recognizer 132may utilize one or more heuristics to split and/or merge multipleclusters into a single cluster (e.g., representing a single table). Forexample, the table recognizer 132 may expand the candidate table border134 to include one or more additional words of the text 108 located nearthe smallest rectangular border discussed above. Estimating thecandidate table border 134 according to such heuristics may enable thetable recognizer 132 to better correct and account for incorrectlyclassified words of the classified text 130 and to better account forerrors (e.g., training areas of the machine learning model 128.Similarly, the clusters may be refined to include words of the text 108whose location (e.g., boundary backs) are near or overlaps with theborder of a cluster.

In certain implementations, the document 102 may include more than onepage. For example, if the document 102 is a document image, the document102 may include multiple images of the pages of the document 102. Insuch implementations, the document intake system 104 may analyze eachpage image individually. In other implementations, the document intakesystem 104 may analyze multiple pages of the document at the same time.In such implementations, for example, if the machine learning model 128is implemented as a recurrent neural network, each of the multiple pagesmay be analyzed individually (e.g., in sequence). Alternatively, if themachine learning model 128 is implemented as a convolutional neuralnetwork (e.g., a spatial convolutional network), multiple pages may beanalyzed by appending the images of each page to be analyzed togetherinto a single image for analysis.

The CPU 136 and the memory 138 may implement one or more of the documentintake system 104 features, such as the optical character recognizer106, the table detection system 110, including the reading orderpredictor 112, the feature set creator 120, the text classifier 126, andthe table recognizer 132. For example, the memory 138 may containinstructions which, when executed by the CPU 136 may perform one or moreof the operational features of the document intake system 104.

FIG. 2A depicts a table feature description 200 according to anexemplary embodiment of the present disclosure. The description 200includes a table 202 containing a plurality of rows and columns. Thetable 202 includes four rows, with the first row identifying theinformation stored in each column (i.e., column headers). Namely, thefirst column corresponds to a date of the entry, the second columncorresponds to an amount (i.e., a dollar amount) for the entry, and thethird column corresponds to the total quantity. The table 202 may be anexample of an order form invoice, or other summary of purchases betweentwo parties, with the quantity column indicating the total quantity ofgoods purchased on the date and the amount indicating the total cost ofthe goods ordered on that date. The table 202 includes multiple rulerlines separating the rows and columns of the table 202. As depicted,ruler lines 216, 218 separate the columns of the table 202, and theruler lines 212, 214 separate the rows. Additional ruler lines aredepicted in the table 202, but are not identified with reference numbersto reduce complexity. The features identified in the example 200 aredepicted relative to the word 204 corresponding to “$145” in the amountcolumn (i.e., the amount entry for the third row of the table 202. Thefeatures themselves will be discussed in greater detail below.

FIG. 2B depicts an example feature set—text association 240 according toan exemplary embodiment of the present disclosure. The association 240includes a text 242 and a feature set 244 associated with the text 242.The text 242 may include one or more words within a text 108 of adocument 102. For example, the association 240 may be an example of theassociation between the feature set 122, 124 and a word from the text108 and/or the read text 114. The association 240 may also indicate thatthe feature set 244 was created, at least in part, from features derivedfrom the text line image 302.

As mentioned above, the feature set 244 may be configured to store textfeatures and spatial features of the corresponding text 242. Thesefeatures may be features solely of the text 242, or may also includefeature information of words near the text 242 (e.g., features of thetext 242 as compared to similar features of nearby words and documentfeatures). As depicted, the feature set 244 includes text features suchas orthographic properties 246, syntactic properties 248, and formattingproperties 250. The feature set 244 as depicted also includes spatialfeatures such as a nearby ruler line feature 252, neighbor alignment254, and neighbor distance 256.

The orthographic properties 246 may include specific identifiersindicating whether the corresponding text 242 contains certainorthographic features. For example, the orthographic properties 246 mayinclude numbers, symbols (e.g., financial, legal, scientific, orengineering symbols), camel case letters, capitalized words, words withall capitalized letters, and words with all lowercase letters. Incertain implementations, the orthographic properties 246 may be storedas a one-hot encoded vector. For example, the orthographic properties246 may be configured as a vector whose dimension is the same as thenumber of orthographic features being identified, where each positionwithin the vector corresponds to a particular orthographic feature atissue. In such an example, a particular orthographic feature may beidentified within the vector as a “1” in the corresponding position,with “0” values stored at the position corresponding to the otherorthographic features that are not present. The orthographic properties246 may be directly calculated from a string or vector of the text 242.In certain implementations, or for certain documents types, certainorthographic features may indicate a greater likelihood that the word iscontained within a table 202 of the document 102. For example, numbers,especially numbers preceded by a financial symbol such as “$,” may bemore likely to occur within a table 202 for particular document types(e.g., financial documents). Thus, orthographic properties 246calculated for text 242 from financial documents may likely include anindication of whether the text 242 is a number, or is a financial symbolsuch as “$.” The orthographic properties 246 of the word 204 mayindicate that the word 204 contains only numerals, which may suggest itspresence within a table 202. Similarly, the orthographic properties 246may indicate that the word 204 includes a financial symbol (i.e., “$”),which may also suggest its presence within a table 202.

The syntactic properties 248 may include an indication of typical usageor associated words of the corresponding text 242. For example, thesyntactic properties 248 may include an embedding vector, i.e., aword-to-vector representation of a word or words of the correspondingtext 242. The embedding vector may include information regarding thesemantics of the text, e.g., similar words, commonly associated words,relevant subject areas, word origins. The embedding vector may beprovided by a third party and may be stored in a memory 138. Theinformation contained in the embedding vector may assist the textclassifier 126 or the machine learning model 128 in classifying thewords of the text 108 as candidate table words 140, 142. For example,certain words may be common in headings for tables in specific documentsor document types (e.g., “table,” “date,” “revenue,” “amount,” “total”).The syntactic properties 248 of these words may indicate that thesewords are commonly used as headings for such tables. Similarly, certainwords may be common within cells of tables in certain documents ordocument types (e.g., product names, contact information, accountingterms, financial statement items). The syntactic properties 248 of thesewords may likewise indicate that these words are commonly used withintables. The syntactic properties 248 of the text 204 may indicate that“$145” is a dollar amount, although in certain implementations syntacticproperties 248 may not be available or useful for strictly numericportions of a text 108.

The formatting properties 250 may indicate whether certain specifiedtypes of formatting are applied to the corresponding text 242. Ininstances where the corresponding text 242 includes only one word, theformatting properties 250 may indicate the formatting of that word. Inother embodiments where the corresponding text 242 includes multiplewords, the formatting properties 250 may include formatting applied to asubset of those words (e.g., applied to any word or a majority of thewords of the corresponding text 242) or may include formatting common toall of the words of the corresponding text 242. Applied formattingincluded within the formatting properties 250 may include whether thetext 242 is bolded, italicized, underlined, or struck through. Theformatting properties 250 may also include font sizes and fonts of thecorresponding text 242. The formatting properties 250 may be stored as aone-hot encoded vector, similar to the discussion above in connectionwith the orthographic properties 246. For example, the one-hot encodedvector may include the formatting features specified for identification,and may include a “1” for formatting features present within thecorresponding text 242 and a “0” for those features that are notincluded in the corresponding text 242. In certain implementations, orfor certain document types, certain formatting properties 250 mayindicate a greater likelihood that the corresponding text 242 isincluded within a table 202. For example, bolded, italicized, orunderlined formatting may be more likely to be used for heading text ofa table 202, and may therefore suggest that a word is included within atable 202. Similarly, a larger font size relative to text below thecorresponding text may also indicate that the word is used as a headerin a table 202. Analogously, text smaller than text used throughout adocument 102 may indicate that the corresponding words are locatedwithin a table 202.

Although not depicted, the features may also include a document type ofthe document 102, which may be one-hot encoded in each feature set 244,similar to the one-hot encoding discussed above in connection with theorthographic and properties 246, 250. Additionally, the features mayinclude confirmed candidate table word information for words that havealready been identified as candidate table words within the document102. Confirmed candidate table word information may assist the machinelearning model 128 with identifying new candidate table words, ascandidate table words likely cluster together to form a table within thedocument 102 in some implementations.

The nearby ruler line feature 252 may include an indication of thepresence of ruler lines within the document 102 near the correspondingtext 242. For example, the presence of ruler lines may be encoded as afour-dimensional vector within the feature set 244, where each dimensionincludes a Boolean that corresponds to one direction (i.e., up, down,left, or right) and encodes whether a ruler line is nearby (e.g., withina certain distance of the corresponding text 242) in the givendirection. As another example, if the layout information is stored as a4-dimensional vector, with entries in the vector corresponding to thedirections (left, right, up, down), a line below the corresponding text242 may be encoded as (0,0,0,1). In certain document types, (e.g., aninvoice) ruler lines in one or more directions may indicate a greaterlikelihood that the corresponding text 242 is part of a table 202, withmore nearby ruler lines indicating an increased likelihood that thecorresponding text 242 is part of a table 202. The nearby ruler linefeature 252 for the word 204 may indicate that there are ruler lines212, 214, 216, 218 nearby and above, below, to the left of, and to theright of the word 204, respectively. In implementations where the nearbyruler line feature 252 is implemented as a one-hot encoded vector, thatvector may be represented as (1, 1, 1, 1). Given that there are nearbyruler lines in all four directions, the nearby ruler line feature 252may provide a strong indication that the word 204 is included within atable 202 of the document 102. In another implementation, rather than aone-hot encoded vector, the nearby ruler lines feature 252 may beimplemented as a vector indicating the distance to the nearest rulerline in each direction. For example, a vector represented as (15, 0, 0,0) may indicate that there is a ruler line 15 pixels above the word 204.In another implementation, the vector may be normalized by a fixed value(e.g., normalized by 100 pixels). In such implementations, thepreviously-discussed vector may alternatively be represented as (0.15,0, 0, 0).

The neighbor alignment 254 may include an indication of the alignment ofthe corresponding text 242 relative to its neighbors. For example, theneighbor alignment 254 of the word 204 may indicate that the word 204 iscenter aligned along the center axis 210 relative to its neighbors(e.g., “Amount,” “$100,” and “$250”). In this example, the neighboralignment 254 is measured relative to a center alignment along thecenter axis 210, but other alignment measurements are possible, such asleft alignment (e.g., along the left axis 206) or right alignment (e.g.,along the right axis 208). Different alignment measurements may beuseful for tables in different document types, or within differentcolumns of single tables. For example, in certain document types,numbers may typically be center-aligned with one another, but textentries may be left-aligned. Accordingly, neighbor alignment 254 mayinclude measurements along all three axes in order to best account fordiffering information types within a single table 202. The neighboralignment 254 for the word 204 may show a strong center alignmentbetween the word 204 and its neighbors, which may provide a strongindication that the word 204 is included within a table 202.

The neighbor distance 256 may include a comparison of the distancebetween the corresponding text 242 and text neighboring thecorresponding text 242, as compared to the median distance betweenneighboring words within the document 102. The neighbor distance 256 maycompare the horizontal distance between horizontal neighbors, and mayalso or alternatively compare the vertical distance between verticalneighbors. For example, the word 204 has two neighboring distances 222,228, depicted relative to the two horizontal neighbors “Sep. 29, 2017”and “15.” For certain documents or document types, a neighbor distance256 larger than the median neighboring distance within the document 102may indicate that the corresponding text 242 is likely contained withina table 202. This may be true of both horizontal and verticalneighboring distances.

In some implementations, the feature set 244 may be created by a featureset creator 120. In creating the feature set 244, the feature setcreator 120 may analyze the corresponding text 242 to ascertain thefeatures. The feature sets 122, 124, 244 may be stored as aone-dimensional array of floats, where each entry in the arraycorresponds to a feature of the feature set 244. The feature set creator120 may interact with external systems, such as an embedding vectorprovider, to gather features associated with the feature set 244. Thefeature set creator 120 may also create multiple feature set 244 formultiple words within the text 108, each corresponding to one or morewords within the text 108. In certain implementations, accuracy of thetable detection system 110 may be improved if a separate feature set 244is calculated for each word of the text 108 to improve granularity.

The features included in the feature set 244 may differ depending on thedocument 102 or the document type. For example, the features may beselected while training the machine learning model 128, as discussedbelow, to include the features identified as most relevant to theaccuracy of machine learning model 128 predictions.

The feature set 244 may be used by the text classifier 126, e.g.analyzed by the machine learning model 128, to estimate the probabilitythat each feature set 244 is a candidate table word 140, 142. Thesecandidate table words 140, 142 may then be used by the table recognizer132 to recognize a table location, such as a candidate table border 134of a table 202 of the document 102.

FIG. 3 depicts a flow chart of a method 300 according to an exemplaryembodiment of the present disclosure. The method 300, when executed, maybe used to predict a reading order 118 of a text 108 of a document 102,generate features vectors 122, 124, 244 for one or more words of thetext 108, classify the text 108 to create a classified text 130containing candidate table words 140, 142, and to use the classifiedtext 130 to identify a table location, such as a candidate table border134, indicating a location of a table 202 within the document 102. Themethod 300 may be implemented on a computer system, such as the documentintake system 104. For example, one or more steps of the method 300 maybe implemented by the table detection system, including the readingorder predictor 112, the feature set creator 120, the text classifier126, and the table recognizer 132. For example, all or part of themethod 300 may be implemented by the CPU 136 and the memory 138.Although the examples below are described with reference to theflowchart illustrated in FIG. 3 , many other methods of performing theacts associated with FIG. 3 may be used. For example, the order of someof the blocks may be changed, certain blocks may be combined with otherblocks, one or more of the blocks may be repeated, and some of theblocks described may be optional.

The method 300 may begin with the document intake system 104 receivingthe text 108 of a document 102 (block 302). The document 102 may beassociated with a document type that the document intake system 104 isconfigured to process, as described above. In certain implementations,the text 108 may be omitted and only the document 102 may be received,in which case the optical character recognizer 106 may perform OCR onthe document 102 to recognize a text 108 of the document 102.

The table detection system 110 may then proceed to predict a readingorder 118 of the text 108 (block 304). The reading order 118 mayindicate the predicted order in which the words of the text 108 areread. Predicting the reading order 118 may help ensure that laterprocessing of the text occurs in the proper order (i.e., in the orderthe words are read by a human reader or in the order that optimizeslater processing accuracy by an automated document processing system),which may assist the table detection system 110 in properlyunderstanding the document 102. As described above, the reading orderpredictor 112 may be configured with more than one reading orderprediction procedure, and may use different procedures for differentdocument types. One such reading order prediction procedure may includeassigning a second word of the text 108 is coming after a first word ofthe text 108 if one of the following are true: (i) the second word isbelow the first word according to the words' vertical coordinates on apage of the document 102, or (ii) the second word is at the same orsimilar vertical coordinate as the first word, but is positioned to theright of the first word according to the words' horizontal coordinateson a page of the document 102. After detecting the reading order 118,the reading order predictor 112 may assemble a read text 114 thatincludes both the text 108 and the reading order 118. For example, theread text 114 may rearrange the words of the text 108 such that theyoccur in the order reflected in the reading order 118 predicted by thereading order predictor 112. Storing and providing the reading order 118in this way may make for simplified processing later, e.g., by thefeature set creator 120 and the text classifier 126.

The table detection system 110 may then proceed to calculate featuresets 122, 124, 244 for words of the text 108 (block 306). As describedabove, the feature sets 122, 124, 244 may in certain implementationseach correspond to a single word of the text 108, but in otherimplementations may correspond to more than one word of the text 108. Incertain implementations, the feature set creator 120 may create featuresets 122, 124, 244 for every word of the text 108, but in otherimplementations may only calculate feature sets 122, 124, 244 for asubset of the words of the text 108. In creating the feature sets 122,124, 244, the feature set creator 120 may begin by selecting a firstword or words of the text 108 (block 318). This first word may be, e.g.,the first word in the reading order 118 on a given page of the document102. For example, if the document 102 contains multiple pages, thefeature set creator 120 may sequentially create feature sets 122, 124,244 for the words on each page of the document 102 in the order that thepages appear in the document 102. In other implementations, the documentintake system 104 and the table detection system 110, including thefeature set 120 may be configured to process each page of the document100 individually.

Next, the feature set creator 120 may calculate the text features forthe selected word or words (block 320). The feature set creator 120 maycalculate the orthographic properties 246 of the selected word or words,for example, by evaluating whether the selected word or words havecertain orthographic features, as discussed above. The feature setcreator 120 may calculate syntactic properties 248 of the selected wordor words, for example, by looking up an embedding vector from anexternal embedding vector database or embedding vector provider orcalculating the embedding vector internally within the document intakesystem 104, for example using a large corpora of text (e.g., a corporaof text collected from websites or publications such as Wikipedia).Similarly, the feature set 120 may calculate formatting properties 250of the selected word or words by, for example, evaluating whether theselected word or words have certain formatting features, such as boldedtext, underlined text, italicized text, or text of a larger or smallerfont size than is typically used in the document 102 (e.g., than themedian font size of the document 102 or of the current page of thedocument 102).

The feature set creator 120 may then proceed to calculate the neighborfeatures for the selected word or words, such as the neighbor alignment254 and the neighbor distance 256 (block 322). The feature set creator120 may calculate the neighbor alignment 254 by, for example,calculating the relative alignments of the neighboring words along acenter axis 210, a left axis 206, and a right axis 208. Alternatively,the feature set creator 120 may only calculate one of these alignments,according to the features determined to be most useful while trainingthe machine learning model 128. The feature set creator 120 maycalculate the neighbor distance 256 by measuring the distance (e.g., inpixels, inches, centimeters, or characters) between the selected word orwords and their horizontal and vertical neighbors. In certainimplementations, as discussed above, the neighbor alignment 254 and theneighbor distance 256 may also include an indication of how theserespective alignment and distance measurements compare to the medianalignments and distance measurements within the document 102. In suchimplementations, the feature set creator 120 may first calculate theneighbor alignment 254 and/or the neighbor distance 256 for each word orwords of the text 108 and may then calculate a median alignments ordistance for the document 102 based on all of the calculated alignmentand distance measurements. The feature set creator 120 may then go backto the feature sets 122, 124, 244 for each word or words to calculateand add the comparison to the median alignments and distancemeasurements for the document 102. In other implementations, the featureset creator 120 may calculate the neighbor alignment 254 and theneighbor distance 256 relative to a fixed number (e.g., a fixed neighboralignment or fixed neighbor distance). The value of the fixed number maybe set, e.g., during a training process, such as training the machinelearning model 128. The fixed number may vary depending on documenttype, font, and layout settings of the document. In otherimplementations, the comparison to the median alignments and distancemeasurements may not be included within the feature set 122, 124, 244itself, but may instead be indirectly determined later in processing,e.g., by the text classifier 126 or the machine learning model 128.

The feature set creator 120 may then proceed to calculate the rulerfeatures (block 324). For example, the feature set creator 120 maycalculate the nearby ruler line feature 252 by searching the area aroundthe selected word or words, e.g., the area between the selected word orwords and the neighboring words. If ruler lines are detected in one ormore directions in the area around the selected word or words, thefeature set creator 120 may store an indication of the presence anddirection of the ruler lines, e.g., as a one-hot encoded vector asdiscussed above.

The feature set creator 120 may omit calculation for certain of theabove-discussed features 246, 248, 250, 252, 254, 256. In particular,certain features may be more or less advantageous for certain documents.For example, the formatting properties 250 may be less essential toaccurately detect tables within documents 102 of certain document types,e.g., accounting statements, where large proportions of the document 102include tables and therefore have similar formatting, making comparisonsto median font size and typical formatting less informative. Similarly,a nearby ruler line feature 252 may be less essential to accuratelydetect tables within documents 102 of document types where tables do notgenerally include ruler lines, e.g., financial statements such as incomestatements. Therefore, in such implementation, to improve processingtime, these aspects may be omitted from the calculated features of thefeature set 122, 124, 244. Additionally, the features 246, 248, 250,252, 254, 256 calculated by the feature set creator 120 may be selectedduring a machine learning model 128 training phase to include thosefeatures 246, 248, 250, 252, 254, 256 most important to the machinelearning model 128's accurate operation.

The feature set creator 120 may then determine whether there areadditional words for which a feature set 122, 124, 244 is required(block 326). For example, the feature set creator 120 may determinewhether there are additional words within the text 108, or within thesubset of the text 108 for which feature sets 122, 124, 244 are beingcalculated. If the feature set creator 120 determines that there areadditional words, processing may continue by selecting the next word orwords within the text 108 for processing (block 316). For example, thefeature set creator 120 may select the next word or words in the readingorder 118. Further, if the previous word being processed was at the endof a page of the document 102, the feature set creator 120 may selectthe first word on the following page according to the reading order 118.Alternatively, if the table detection system 110 and/or the documentintake system 104 are configured to process each page of the document102 individually, the feature set creator 120 may determine that thereare no additional words for processing if the previously processed wordis at the end of a page of the document 102. Once the next word or wordsare selected, processing may continue through blocks 320, 322, 324 untilthere are no further words for which feature sets 122, 124, 244 are tobe calculated remaining at block 326.

If the feature set creator 120 determines that there are no additionalwords, the text classifier 126 may then analyze the feature sets 122,124, 244 (block 308). As described above, the text classifier 126 mayanalyze the feature sets with the machine learning model 128, which maypredict the likelihood that the corresponding word or words arecontained within a table 202 of the document 102. In particular, themachine learning model 128 may be configured to analyze each feature set122, 124, 244 and estimate a probability that each corresponding word iscontained within a table 202. In certain implementations, the machinelearning model 128 may be trained to find patterns within feature setscorresponding to localized regions of words within the document 102. Forexample, the machine learning model 128 may be configured to analyzeeach feature set in connection with the feature sets of neighboringwords to estimate the likelihood that the corresponding word iscontained within a table 202. In still further implementations, themachine learning model 128 may be configured to analyze all of thefeature sets 122, 124, 244 at the same time to identify broader spatialpatterns within the document 102. In such implementations, it may behelpful to break the document 102 up by individual pages (or smallergroups of pages) and analyze each individually, to limit the associatedprocessing requirements. The machine learning model 128 may accordinglyincorporate additional spatial information beyond what is specificallycalculated in the feature set 122, 124, 244. For instance, even if afeature set 122, 124, 244 does not expressly include nearby words to thecell being analyzed, a machine learning model 128 may be configured toanalyze surrounding cells may determine and incorporate that informationfrom the surrounding cells themselves. In certain implementations, themachine learning model 128 may be a recurrent neural network configuredto analyze cell regions of the document 102. However, other machinelearning model 128 implementations are possible, including aconvolutional neural network. In such implementations, the features mostimportant to the successful operation of the machine learning model 128may include the neighbor alignment 254, the neighbor distance 256, andthe orthographic features 246.

As described in greater detail below, the machine learning model 128 maybe trained to detect certain kinds of tables, and may be trained toanalyze documents 102 of a specific document type. For example, onemachine learning model 128 may be trained to analyze legal documents orspecific types of legal documents (e.g., leases or purchasingagreements) and another machine learning model 128 may be configured toanalyze financial documents or a certain type of financial document(e.g., invoice values, accounting documents). For this reason, althoughonly depicted in the singular, the text classifier 126 may include morethan one machine learning model 128 and may use the machine learningmodels 128 to analyze certain documents or to detect specific kinds oftables.

In analyzing the feature sets 122, 124, 244, the text classifier 126 mayfirst select a first feature set 122, 124, 244 (block 328). For example,the text classifier 126 may select the feature set 122, 124, 244corresponding to the first word or words of the document 102, or thefirst word or words of the current page being processed from thedocument 102. The text classifier 126 may then examine the features fromthe selected feature set 122, 124, 244 (block 332). For example, themachine learning model 128 may analyze the text features 246, 248, 250and the spatial features 252, 254, 256 of the feature set 122, 124, 244.In analyzing the feature set 122, 124, 244, the machine learning model128 may identify one or more characteristics that increase thelikelihood that the corresponding text 242 to the selected feature set122, 124, 244 is included within a table 202. For example, as discussedabove, certain orthographic features within the orthographic properties246 of the corresponding text 242 may indicate a greater likelihood thatthe corresponding text 242 is included within a table 202 (e.g.,financial symbols, numbers, dollar amounts, and dates in numericalformat). Similarly, certain formatting features within the formattingproperties 250 may indicate a greater likelihood that the correspondingtext 242 is included within a table 202 (e.g., bolded text, underlinedtext, italicized text, differing font sizes). Further, certain spatialfeatures, such as highly consistent neighbor alignment 254, nearby rulerlines in one or more directions 252, and consistent neighbor distance256 may further indicate a greater likelihood that the correspondingtext 242 is included within a table 202.

The machine learning model 128 may also compare the features in theselected feature set 122, 124, 244 to the features within feature sets122, 124, 244 corresponding to nearby words (block 334). For example,although a combination of features within the selected feature set 122,124, 244 may indicate a certain likelihood that the corresponding text242 is included within a table 202, similar features reflected infeature sets 122, 124, 244 corresponding to nearby words may indicate aneven stronger likelihood that the corresponding text 242 of the selectedfeature set 122, 124, 244 is included within a table 202. For example,words that are included within a table 202 may often occur near oneanother, and may tend to have consistent text features 246, 248, 250(e.g., numerical characters, font sizes, and related word syntax) andspatial features 252, 254, 256 (e.g., presence of surrounding rulerlines, alignment with one another, and consistent distance between oneanother). Accordingly, such shared features between nearby words and thecorresponding text 242 may provide a strong indication that these wordsare included within a table 202.

Based on the preceding analysis, the machine learning model 128 may thendetermine a table candidacy for the corresponding text 242 (block 336).The table candidacy determination may include, e.g., a predictedlikelihood that the corresponding text 242 is included within a table202 of the document 102.

The text classifier 126 and machine learning model 128 may omit theanalysis performed at certain of the above-discussed blocks 332, 334,336. For example, in certain implementations, the machine learning model128 may not expressly compare the features of the selected feature set122, 124, 244 with those of nearby words at block 334, e.g., becausesuch features are not informative for tables contained within certaindocument types, or because such features may be adequately reflectedwithin certain spatial features, such as the neighbor alignment 250 forthe neighbor distance 256. Accordingly, to reduce processing time anddemands, such analysis may be omitted.

The text classifier 126 and/or the machine learning model 128 may thendetermine whether there are additional feature sets for analysis (block338). For example, the text classifier 126 may determine whether thereare feature sets 122, 124, 244 corresponding to additional words withinthe text 108, or within the subset of the text 108 for which featuresets 122, 124, 244 were calculated. If there are additional feature sets122, 124, 244 for analysis, the text classifier 126 and/or the machinelearning model 128 may select the next feature set 122, 124, 244 foranalysis (block 330). For example, the next feature set 122, 124, 244may be the feature set 122, 124, 244 corresponding to the next word inthe reading order 118. Once the next feature set 122, 124, 244 isselected, processing may continue through blocks 332, 334, 336 untilthere are no additional feature sets 122, 124, 244 remaining at block338.

When no additional feature sets 122, 124, 244 remain for analysis, thetext classifier 126 may then identify the candidate table words 140, 142(block 310). As discussed above, the candidate table words 140, 142 maybe selected as those words within the text 108 with a predictedlikelihood of being within a table 202 that exceeds a certain threshold.For example, the candidate table words 140, 142 may include all wordswith a predicted likelihood greater than 50% or 75% of being containedwithin a table 202. The text classifier 126 may also create a classifiedtext 130 that includes an indication of those words within the text 108that or identified as candidate table words 140, 142. In certainimplementations, the candidate table words 140, 142 may be identifiedusing a one-hot encoding, such as a Boolean indicator corresponding tocandidate table word status. In other implementations, the indicationmay include only the predicted likelihood from the machine learningmodel 128 that the words are included within a table 202.

The table recognizer 132 may then identify a cluster or clusters ofcandidate table words 140, 142 (block 312). In identifying clusters ofcandidate table words 140, 142, the table recognizer 132 may utilizeboth the table candidacy determination generated by the text classifier126 and the location information concerning the candidate table words140, 142. For example, the clusters may be identified as including allcandidate table words that are located near one another, and in order toidentify which candidate table words are near one another, locationinformation regarding the candidate table words may be necessary.Additionally, certain documents 102 or document types may generally onlyinclude one table per page. In implementations concerning such documents102 or document types, the table recognizer 132 may be configured toattempt to identify only a single cluster of candidate table words 140,142 per page. However, other documents 102 or document types mayoccasionally have more than one table on certain pages. Accordingly,implementations processing these types of documents 102 may beconfigured to potentially identify more than one cluster per page of thedocument 102. For these reasons, the table recognizer 132 may includemore than one cluster identification procedure. However, certainimplementations may be adequately served by a single procedure, such asa density-based clustering method (e.g., DBSCAN). In still furtherimplementations, the clusters may be identified by a second machinelearning model, trained to identify clusters of candidate table words indocuments of one or more document types (e.g., a machine learning modeltrained to co-reference the predictions of the text classifier 126, suchas a convolutional or attention machine learning model). The tablerecognizer 132 may then identify the location or locations of the tableor tables 202 within the document 102 (block 314). In certainimplementations, the table recognizer 132 may be configured to identifya candidate table border 134 for each table 202. For example, for eachcluster of candidate table words 140, 142 identified by the tablerecognizer at block 312, the table recognizer 132 may recognize aseparate candidate table border 134, e.g., because each cluster ofcandidate table words 140, 142 may correspond to a single table withinthe document 102. Accordingly, the table recognizer 132 may recognize acandidate table border 134 as the smallest rectangular border thatencloses each of the candidate table words 140, 142 within a cluster. Inpractice, identifying the candidate table border 134 in this way mayinclude additional words from the text 108 that have not been expresslyidentified as candidate table words 140, 142. For example, it may becommon in certain implementations for the first and final words of atable 202 to be incorrectly excluded as candidate table words 140, 142.However, it may also be typical in such examples for other words of thetable 202 (e.g., words in the same row and/or column as the first andfinal words) to be accurately identified as candidate table words 140,142. Therefore, these words are likely to be included in a cluster ofcandidate table words 140, 142 and identifying the table location as thecandidate table border 134 discussed above will still include the firstand final words by virtue of the cluster including candidate table words140, 142 in the same row and/or column. In this way, the tablerecognizer 132 may be able to account for and adapt to errors by themachine learning model 128. Once the table location has been accuratelyidentified, processing may continue on the document 102, for example toanalyze the content of the document 102 by an automated documentprocessing system.

All or some of the blocks of the method 300 may be optional. Forexample, the method 300 may be performed without executing one or moreof blocks 316, 318, 320, 322, 324, 326 in connection with creating thefeature sets 122, 124, 244 at block 306. Instead, processing at block404 may continue directly to analyzing the feature sets at block 308.Similarly, the analysis at block 308 need not necessarily be performedby executing one or more of blocks 328, 330, 332, 334, 336, 338. Rather,processing from block 308 may proceed directly to identifying thecandidate table words 140, 142 at block 310.

FIGS. 4A and 4B depict an example named entity recognition procedure 400according to an example embodiment of the present disclosure. In someimplementations, the procedure 400 may be performed according to amethod for predicting a reading order 118 of a text 108 of a document102, generating features vectors 122, 124, 244 for one or more words ofthe text 108, classifying the text 108 to create a classified text 130containing candidate table words 140, 142, and using the classified text130 to identify a candidate table border 134 indicating an estimatedlocation of a table 202 within the document 102, such as the method 300.As described in greater detail below, the steps performed in conjunctionwith the procedure 400 may be performed by the document intake system104, the table detection system 110, the reading order predictor 112,the feature set creator 120, the text classifier 126, and/or the tablerecognizer 132.

The procedure 400 may begin in FIG. 4A with the document 402. Althoughsimilar steps may be used to analyze a document for a page of a documentin its entirety, the document 402 depicted only includes a portion of apage for clarity in presentation. As can be seen, the depicted portionof the document 402 includes a text portion, reciting, “Your purchaserecords are recorded for your reference in Table 1 below.” The document402 also includes a table 428 including multiple words 404-426. Belowthe table 428 is a second text portion reciting, “Table 1.” In certainimplementations, a text 108 of the document 402 may be recognized by theoptical character recognizer 106 as discussed above. Additionally,further accurate processing of the contents of the document 402 (e.g.,processing the contents of the table 428) may depend on accuratedetection of the table 428 within the document 402.

To accurately detect the table 428, the table detection system 110 mayfirst predict a reading order 118 for the text 108 of the document 402.As discussed above, the reading order predictor 112 may use one ofseveral reading order prediction procedures, including assigning asecond word as coming after a first word of the text 108 if either (i)the second word is below the first word according to the words' verticalcoordinates on a page of the document 402 (e.g., “Table 1” is below“recorded”) or (ii) the second word is both at the same or similarvertical coordinates as the first word, but is positioned to the rightof the first word according to the words' horizontal coordinate on apage of the document 402 (e.g., “recorded” is at the same verticalcoordinate as “purchase,” but is located to the right). In the twopreceding examples, “Table 1” may be assigned as after the word“recorded,” which is itself assigned after the word “purchase” in thereading order 118. As another example, word 408 may be assigned afterword 406 according to condition (ii) above, and word 412 may be assignedafter word 408 in the reading order 118 according to condition (i)above. The reading order predictor 112 may proceed in this way until allwords within the text 108 of the document 402 are assigned as comingeither before or after the other words of the text 108.

Once the reading order 118 is predicted, the feature set creator 120 maycreate feature sets 122, 124, 244 for one or more words of the text 108.For example, the feature set creator 120 may create feature sets 122,124, 244 for each of the words of the text 108. The feature sets 122,124, 244 may include one or more text or spatial features of thecorresponding text 242, as discussed above. The machine learning model128 of the text classifier 126 may then analyze these feature sets 122,124, 244 to determine whether the words of the text 108 are candidatetable words. In the example depicted in FIG. 4A, the words identified ascandidate table words are indicated with a dashed outline. Accordingly,as depicted, words 406-424 are identified by the machine learning model128 as candidate table words.

In identifying the candidate table words 406-424, the machine learningmodel 128 may analyze one or more features of the corresponding featuresets 122, 124, 244. For example, word 418 recites “$145.” Accordingly,the orthographic properties 246 of word 418 may indicate the presence ofa financial symbol “$,” along with numerical figures “145.” The presenceof these figures may indicate that the word 418 is likely to be includedwithin a table 428. Additionally, spatial features, such as a nearbyruler line feature 252 may indicate that there are ruler lines above,below, to the left, and to the right of the word 418 (e.g., a one-hotencoded vector of (1, 1, 1, 1)). As discussed above, these nearby rulerlines may provide a strong indication that the word 418 is includedwithin a table. By contrast, ruler lines in only one direction may notbe a strong indication, as ruler lines only above or below thecorresponding text 242 may, for example, instead indicate that thecorresponding text 242 is, respectively, a caption or title of the table202, as with the text “Table 1” below the table 202. Additionally, theneighbor alignment 254 may indicate a strong center alignment with theabove and below neighbors 412, 424 of the word 418. The neighbordistance 256 may also indicate a strong consistency of distance betweenthe word 418 and its neighbors to the left and right. For example, thedistance 222 between the word 418 and its left neighbor 416 as definedin FIG. 2A is very similar to the distances 220, 224 between neighboringwords 412, 424 and their respective left neighbors 410, 422. Likewise,the distances 228, 226, 230 between the words 418, 412, 424 and theirright neighbors 420, 414, 426 are very similar. The strong similaritybetween the neighbor distances 256 may also provide a strong indicationthat the word 418 is contained within a table 428. A similar analysismay be performed on words 412, 424 in the same column.

Other useful features may include a right alignment measurement withinthe neighbor alignment 254 and formatting properties 250. For example,the words 410, 416, 422 are right aligned with one another. Accordingly,the neighbor alignment 254 for the word 416 may include an indicationthat these words are right aligned, which may also indicate that thesewords are co-located within a column, and therefore likely includedwithin a table 428. Further, the formatting properties 250 and/or theorthographic properties 246 may indicate that these words 410, 416, 422are dates that are formatted numerically. This usage may be typical intables, and may be uncommon within text portions of certain documentstypes. Accordingly, this formatting information may increase thelikelihood that the words 410, 416, 422 are located within a table 428.Further, the words 404, 406, 408 are formatted with bolded text, whichmay be indicated in the formatting properties 250 of their correspondingfeature sets 122, 124, 244. Bolded text may suggest that the text is aheader within tables of certain document types, as is the case in thetable 428, and may therefore increase the likelihood that the words 404,406, 408 are contained within a table 428. Similar reasoning may applyto text that is italicized or underlined for tables in certain types ofdocuments. Based on these characteristics, and other potentialcharacteristics, the machine learning model 128 may determine that thewords 406-424 are candidate table words. In certain instances, dependingon the accuracy of the machine learning model 128, the words 404, 426 onthe first and final corners of the table, as read from left to right,may be incorrectly classified as non-table words. Such an error mayoccur because, e.g., the word 404 comes at the beginning of the readingorder 118 in the table 428, and the word 426 comes at the end of thereading order 118 for the table 428. Additionally, because the words404, 426 are at the corners of the table, the words 404, 426 are notsurrounded as many table words as other words in the table (e.g., theword 412 has table words 406, 410, 414, 418 in all four directions).Therefore, although these words 404, 426 include other indications oftable candidacy, the machine learning model 128 may incorrectlycharacterize them. After identifying the candidate table words 406-424,the table recognizer 132 may identify a table location, signified by thecandidate table border 432 reflected in FIG. 4B. In identifying thetable location, the table recognizer 132 may define the candidate tableborder 432 using one of a number of strategies. As depicted, thecandidate table border 432 may be defined as the rectangle with thesmallest area that encloses all of the candidate table words 406-424. Bydefining the candidate table border 432 in this way, the tablerecognizer 132 accurately include the words 404, 426 within thecandidate table border 432, even though these words 404, 426 were notidentified as the candidate table words 406-424. Other table locationidentification procedures may be used, as discussed above.

By following the above-described procedure, the table detection system110 can accurately identify a candidate table border 432 correspondingto the table 428 within the document 402, without including unnecessarytext. Accordingly, after the table 428 has been accurately detected, thedocument 402 may be forwarded for additional processing by an automateddocument processing system.

FIG. 5 depicts a flow diagram of an example method 500 according to anexample embodiment of the present disclosure. The flow diagram includesa training system 502, a feature set creator 504, a text classifiermodel 506, and a table recognizer 508. The training system 502 may beconfigured to orchestrate the operation of the method 500 and generateupdated model parameters based on the outputs generated during thetraining, as detailed below. In some implementations, the trainingsystem 502 may be implemented as part of a table detection system 110 ora text classifier 126. The feature set creator 504 may be implemented asthe feature set creator 120 and may be configured to create feature sets122, 124, 244 based on a document 102, 402. The text classifier model506 may be implemented as the machine learning model 128 of the textclassifier 126 and may be configured to analyze feature sets 122, 124,244 to text and identify the location of one or more tables 202, 428within the document 102, 402. The table recognizer 508 may beimplemented as the table recognizer 132 and may be configured toidentify a table location, such as a candidate table border 134, 432 fortables 202, 428 located within a document 102, 402.

The method 500 may be used to train the text classifier model 506, whichmay be associated with the text classifier 126. Training the textclassifier model 506 may improve the accuracy of the text classifiermodel 506 at detecting the location of one or more tables within thedocument 102, 402. Alternatively, training the text classifier model 506may allow the text classifier model 506 to detect and identify thelocation of new tables 202, 428, such as tables 202, 428 with new typesof formatting, or new information contained within the tables 202, 428.For example, the text classifier model 506 may be initially trained todetect tables 202, 428 in accounting documents (e.g., income statements)and, after completing the method 500, the text classifier model 506 maybe able to recognize tables 202, 428 in financial documents (e.g., 10-Kfilings). In another example, the text classifier model 506 may beinitially trained to detect tables 202, 408 containing numerical content(e.g., accounting/financial documents) and, after completing the method500, the text classifier model 506 may be able to recognize tablescontaining written content (e.g., contact information provided in leasesor other agreements). In some implementations, the method 500 may beperformed more than once to train the text classifier model 506. Inother implementations, the method 500 may only need to be performed oncein order to properly train the text classifier model 506. A machinelearning operator, such as an automated document processing systemdeveloper, may determine the number of times the method 500 isperformed. Alternatively a training system 502 may determine the numberof times the method 500 is performed. For example, the training system502 may repeat the method 500 until the text classifier model 506 isable to detect tables 202, 428 in a new document type or detect newtypes of tables 202, 428 with a particular level of accuracy.

The method 500 may be implemented on a computer system, such as thedocument intake system 104. For example, one or more steps of the method500 may be implemented by the table detection system 110, including thereading order predictor 112, the feature set creator 120, the textclassifier 126, and the table recognizer 132. The method 500 may also beimplemented by a set of instructions stored on a computer readablemedium that, when executed by a processor, cause the computer system toperform the method. For example, all or part of the method 500 may beimplemented by the CPU 136 and the memory 138. Although the examplesbelow are described with reference to the flowchart illustrated in FIG.5 , many other methods of performing the acts associated with FIG. 5 maybe used. For example, the order of some of the blocks may be changed,certain blocks may be combined with other blocks, one or more of theblocks may be repeated, and some of the blocks described may beoptional.

Additionally, FIG. 5 depicts multiple communications between thetraining system 502, the feature set creator 504, the text classifiermodel 506, and the table recognizer 508. These communications may betransmissions between multiple pieces of hardware or may be exchangesbetween different programmatic modules of software. For example, thecommunications may be transmissions over a network between multiplecomputing systems, such as over the Internet or a local networkingconnection. These transmissions may occur over a wired or wirelessinterface. Other communications may be exchanges between softwaremodules, performed through an application programming interface (API),or other established communication protocols.

The method 500 may begin with the training system 502 creating orselecting a training document and a training text (block 510). Thetraining system 502 may create the training text by using an opticalcharacter recognizer 106 to extract text from a document 102, 402.Alternatively, the training system 502 may be connected to or contain amemory that stores training texts and may select one of the trainingtexts for use in training the text classifier model 506. The trainingsystem 502 may create the training text based on the purpose fortraining the text classifier model 506. For example, if the textclassifier model 506 is being trained to process a new document type,the training system 502 may create the training text to include textoriginating from or similar to the new document type. In anotherexample, if the text classifier model 506 is being trained to improveits accuracy, the training system 502 may create a training text thatincludes particularly difficult table layouts. In a still furtherexample, if the text classifier model 506 is being trained to identify anew table layout, the training system 502 may create a training textthat includes a table or tables with the new table layout.

The feature set creator 504 may then create a training feature setsbased on the training text (block 512). As with the feature sets 122,124, 244 discussed above, the feature sets may be created based on acorresponding word or words of the training text and may include one ormore text features are spatial features, including orthographicproperties 246, syntactic properties 248, formatting properties 250, anearby ruler line feature 252, neighbor alignment 254, and neighbordistance 256. In certain implementations, rather than creating thetraining feature set the feature set creator 504, the training system502 may provide a training feature set based on the training documentand training text created or selected at block 510.

The text classifier model 506 may analyze the training feature map, thetraining text, and/or the training document (block 514) and identifycandidate training table words in the training document into trainingtext (block 516). As discussed above, the text classifier model 506 mayanalyze the training feature sets to identify candidate table words thatare likely to be located within a training table of the training text.For example, the text classifier model 506 may analyze one or morefeatures within the training feature sets for features suggestingpresence within a training table of the training document. For example,the text classifier model 506 may analyze the text features forindications of certain formatting or orthographic properties 246, 250(e.g., bolded text, numerical text) indicative of the corresponding wordor words of the training text being located within a training table ofthe training document. In addition, the text classifier model 506 mayanalyze spatial features for indications of certain spatial information(e.g., consistent neighbor alignment and/or distance, nearby ruler linesin one or more directions) indicative of the corresponding word or wordsof the training text being contained within a training table of thetraining document. As discussed in examples above, the text classifiermodel 506 may analyze the training feature sets corresponding to wordslocated near one another within the training document to more accuratelyidentify candidate training table words, as training table words mayoccur near one another within the training document. In identifying thecandidate training table words, the text classifier model 506 mayestimate a probability that each word of the training text is containedwithin a table, and may include that estimation within a classified textfor each of the analyzed words of the training text. The text classifiermodel 506, or a related text classifier 126, may identify the candidatetraining table words as those words of the training text that had anestimated probability of being contained within a table that exceeds agiven threshold. This threshold may differ based on document type, andmay be adjusted during the training process according to the method 500in order to increase the text classifier model 506's accuracy atidentify candidate training table words within documents of a similardocument type. The text classifier model 506 or text classifier 126 mayindicate the candidate training table words in a classified trainingtext, similar to the classified text 130 discussed above.

The table recognizer 508 may receive the candidate training table words(block 608). For example, the table recognizer 508 may receive theclassified training text, including the estimated probabilities for eachof the analyzed words of the training text, which may also furtherinclude a specific indicator (e.g., a Boolean indicator) of which wordswere identified as candidate training table words. Alternatively, thetable recognizer 508 may only receive an indication of the wordsidentified as candidate training table words.

The table recognizer 508 may then estimate the training table locationbased on the received candidate training table words (block 520). Asdiscussed above, the table recognizer 508 may estimate the trainingtable location using one or more table location estimation procedures,which may include identifying the rectangle with the smallest area thatencloses all of the candidate training table words. In certainimplementations, the table recognizer 508 may estimate the trainingtable location as a candidate training table border of the trainingtable. The training system 502 may then receive the training tablelocation (block 522). For example, the training system 502 may receivethe estimated candidate training table border of the training tablegenerated at block 520 by the table recognizer 508. The training system502 may then compare the estimated training table location with alabeled table location generated by the training system 502, indicatingthe actual desired table location (e.g., the actual table border for thetraining table) for identification by the table recognizer 508 (block524). The training system 502 may identify one or more errors in thetraining table location. For example, the estimated training tablelocation may exclude one or more words that should be included withinthe estimated table location according to the labeled table location.Alternatively, the estimated training table location may include wordsthat should not be included within the estimated candidate trainingtable border. Further, the estimated training table location may includeall of the desired words for inclusion within the training table, butmay be larger than necessary (e.g., the candidate training table borderfor the estimated table location may not be the smallest possiblerectangle). Prior to estimating the table location and/or the candidatetraining table border, the table recognizer 508 may analyze the receivedcandidate training table words to identify a cluster or clusters ofcandidate training table words, where each cluster corresponds tocandidate training table words likely originating from the same trainingtable within the training document.

After identifying these errors in the estimated training table location,the training system 502 may generate updated model parameters (block526). The updated model parameters may be generated to improve theaccuracy of the text classifier model 506 by, for example, improving theaccuracy of the text classifier model 506 at estimating the trainingtable location within a document based on the feature sets, or byimproving the accuracy of the table recognizer 508 at estimating thetraining table location based on the received candidate training tablewords from the text classifier model. The updated model parameters maybe generated by, for example, adjusting the weights assigned toparticular features included in the training feature sets, or byadjusting whether and to what extent the text classifier model 506analyzes feature sets for nearby words, rather than just for each wordindividually. For example, if the text classifier model 506 is beingtrained to identify tables in which ruler lines are uncommon, the textclassifier model 506 may be updated with parameters that deemphasize theimportance of the nearby ruler line property 252. In this example, thefeature set creator 504 may also be updated to no longer calculate anearby ruler line feature 252 for inclusion within the training featuresets at block 512. The updated model parameters may also be generatedby, for example, altering the strategy by which the table recognizer 508estimates a candidate training table border, or identifies one or moreclusters of the candidate training table words. For example, if themethod 500 is being performed to improve accuracy at estimating thelocation of tables within a document type that frequently contains morethan one table per page, the table recognizer 508 may be updated with adifferent cluster identification method that is better able to identifymore than one cluster of candidate training table words within a singlepage. Additionally, the updated model parameters may include anadjustment of the threshold used by the text classifier model 506 toidentify candidate training table words based on the estimatedprobability of location within a training table.

The feature set creator 504, text classifier model 506, and tablerecognizer 508 may then receive the updated model parameters and beupdated to incorporate the updated model parameters (block 538). Themethod 500 may then repeat again beginning at block 510 to further trainthe model as discussed above. Similarly, blocks 510-528 may be repeatedin whole or in part, e.g., based on multiple training documents andtraining texts, before generating updated model parameters at block 526.

All of the disclosed methods and procedures described in this disclosurecan be implemented using one or more computer programs or components.These components may be provided as a series of computer instructions onany conventional computer readable medium or machine readable medium,including volatile and non-volatile memory, such as RAM, ROM, flashmemory, magnetic or optical disks, optical memory, or other storagemedia. The instructions may be provided as software or firmware, and maybe implemented in whole or in part in hardware components such as ASICs,FPGAs, DSPs, or any other similar devices. The instructions may beconfigured to be executed by one or more processors, which whenexecuting the series of computer instructions, performs or facilitatesthe performance of all or part of the disclosed methods and procedures.

It should be understood that various changes and modifications to theexamples described here will be apparent to those skilled in the art.Such changes and modifications can be made without departing from thespirit and scope of the present subject matter and without diminishingits intended advantages. It is therefore intended that such changes andmodifications be covered by the appended claims.

1. A method comprising: receiving a text of a document image thatincludes a plurality of words depicted in the document image;calculating a plurality of feature sets for the plurality of words,wherein each feature set contains information indicative of one or morefeatures of a corresponding word of the plurality of words; identifyingcandidate table words among the plurality of words based on the featuresets; identifying, with a clustering procedure, a cluster of candidatetable words that correspond to a table within the document image; anddefining a candidate table location including a candidate table borderof the table that contains the cluster of candidate table words.
 2. Themethod of claim 1, wherein the features of the feature set include oneor more text features selected from the group consisting of:orthographic properties of the corresponding word, syntactic propertiesof the corresponding word, and formatting properties of thecorresponding word.
 3. The method of claim 1, wherein the features ofthe feature set include one or more spatial features selected from thegroup consisting of: a nearby ruler line distance, a neighbor alignmentmeasurement, and a neighbor distance measurement.
 4. The method of claim1, wherein the candidate table border is defined as the rectangle withthe smallest area that contains the cluster of candidate table words. 5.The method of claim 1, wherein the clustering procedure is adensity-based spatial clustering of applications with noise (DBSCAN)procedure.
 6. The method of claim 1, wherein the candidate table bordercontains one or more words of the text that are not candidate tablewords.
 7. The method of claim 1, further comprising: predicting areading order for at least a subset of the words of the text.
 8. Themethod of claim 7, wherein the text further includes a location for thesubset of words, and wherein predicting the reading order furthercomprises: assigning a first word of the subset of words as comingbefore a second word of the subset of words in the reading order if oneor more of the following conditions are true: (i) the second word isbelow the first word according to the location of the first and secondwords, or (ii) the second word is at the same height as the first wordand is positioned to the right of the first word according to thelocation.
 9. The method of claim 1, wherein the candidate table wordsare identified using a machine learning model.
 10. The method of claim9, wherein the machine learning model is a recurrent neural network or aconvolutional neural network.
 11. The method of claim 9, furthercomprising: receiving a training text of a training document, includinga plurality of words depicted in the training document and a labeleddocument image indicating a labeled table location of a table within thetraining document; calculating a plurality of training feature sets forthe words of the training text, wherein each training feature setcontains information indicative of one or more features of acorresponding word of the plurality of words of the training text;identifying, with the machine learning model, candidate training tablewords of the training text among the words of the training text based onthe training feature sets; identifying, with the clustering procedure, acluster of candidate training table words that correspond to a tablewithin the document image; defining a candidate training table locationincluding a candidate training table border of the training table thatcontains the cluster of candidate training table words; comparing thetraining table location with the labeled table location to identify atable location error of the training table location; and updating one ormore parameters of the machine learning model based on the tablelocation error.
 12. A system comprising: a processor; and a memorycontaining instructions that, when executed by the processor, cause theprocessor to: receive a text of a document image that includes aplurality of words depicted in the document image; calculate a pluralityof feature sets for the words, wherein each feature set containsinformation indicative of one or more features of a corresponding wordof the plurality of words; identify candidate table words among theplurality of words based on the feature sets; identify, with aclustering procedure, a cluster of candidate table words that correspondto a table within the document image; and define a candidate tablelocation including a candidate table border of the table that includesthe cluster of candidate table words.
 13. The system of claim 12,wherein the features of the feature set include one or more textfeatures selected from the group consisting of: orthographic propertiesof the corresponding word, syntactic properties of the correspondingword, and formatting properties of the corresponding word.
 14. Thesystem of claim 12, wherein the features of the feature set include oneor more spatial features selected from the group consisting of: a nearbyruler line distance, a neighbor alignment measurement, and a neighbordistance measurement.
 15. The system of claim 12, wherein the candidatetable border is defined as the rectangle with the smallest area thatcontains the cluster of candidate table words.
 16. The system of claim12, wherein the memory contains further instructions which, whenexecuted by the processor, cause the processor to: predict a readingorder for at least a subset of the words of the text.
 17. The system ofclaim 16, wherein the text further includes a location for the subset ofwords, and wherein the memory contains further instructions which, whenexecuted by the processor, cause the processor to: assign a first wordof the subset of words as coming before a second word of the subset ofwords in the reading order if one or more of the following conditionsare true: (i) the second word is below the first word according to thelocation of the first and second words, or (ii) the second word is atthe same height as the first word and is positioned to the right of thefirst word according to the location.
 18. The system of claim 12,wherein the text classifier includes a machine learning model configuredto identify the candidate table words, the machine learning modelincluding at least one of a recurrent neural network and a convolutionalneural network.
 19. The system of claim 18, further comprising atraining system configured, when executed by the processor, to: receivea training text of a training document, including a plurality of wordsdepicted in the training document and a labeled document imageindicating a labeled table location of a table within the trainingdocument; calculate a plurality of training feature sets for the wordsof the training text, wherein each training feature set contains one ormore features of a corresponding word of the plurality of words of thetraining text; identify, with the machine learning model, candidatetraining table words of the training text among the words of thetraining text; identify, with the clustering procedure, a cluster ofcandidate training table words that correspond to a table within thedocument image; define a candidate training table location including acandidate training table border of the training table that contains thecluster of candidate training table words; compare the training tablelocation with the labeled table location to identify a table locationerror of the training table location; and update one or more parametersof the machine learning model based on the table location error.
 20. Acomputer-readable medium containing instructions which, when executed bya processor, cause the processor to: receive a text of a document imagethat includes a plurality of words depicted in the document image;calculate a plurality of feature sets for the words, wherein eachfeature set contains information indicative of one or more features of acorresponding word of the plurality of words; identify candidate tablewords among the plurality of words based on the feature sets; identify,with a clustering procedure, a cluster of candidate table words thatcorrespond to a table within the document image; and define a candidatetable location including a candidate table border of the table thatincludes the cluster of candidate table words.